HBO

Ashley Wright & Mubeena Wahaj

2023-04-13

Lights, camera, action!

Today, we’re going to take a deep dive into the world of HBO movies and TV shows. From the iconic dramas like The Sopranos and Game of Thrones to the latest releases. HBO has been providing quality content to its viewers for decades, but have you ever wondered how they make decisions about what shows to produce or which movies to acquire? That’s where the fascinating world of HBO data comes into play. By analyzing audience trends, ratings, and viewer demographics, HBO can make informed decisions about what to offer to its loyal fans. So sit back, grab a snack, and get ready to explore the exciting world of HBO data.

<<<<<<< HEAD

Installing packages

#install.packages("magick")
#install.packages("kableExtra")
#install.packages("countrycode")


library(countrycode)
library(tidyverse)
library(knitr)
library(kableExtra)
library(maps)

#remember to put how each package is used

About Our Data

The data we’ve decided to work on is from kaggle and is owned by Diego Enrique and here’s the link https://www.kaggle.com/datasets/dgoenrique/hbo-max-movies-and-tv-shows

Let us read our datas, shall we?

We’re using the kable and head function to show a part of the data sets we’re working on but in an organized manner

Here’s our credits.csv

=======

Packages Used:

#load magick to process images
#load tidyverse to manipulate data
#load ggplot2 for graphing
#load shiny to...
#load dplyer to manipulate data
#load knitr for general-purpose literate programming
#load kableExtra to add features to table

library(magick)
library(tidyverse)
library(ggplot2)
library(shiny)
library(dplyr)
library(knitr)
library(kableExtra)

#remember to put how each package is used

About Our Data

The data we’ve decided to work on is from kaggle and is owned by Diego Enrique and here’s the link: https://www.kaggle.com/datasets/dgoenrique/hbo-max-movies-and-tv-shows

Titles data:

15 variables, 3030 observations

id: The title ID

title: The name of the title

show_type: Tv show or Movie

description: A description of movie or tv show

release_year: Year show/movie was released

age_certification: The age rating of movie or show

runtime: The length of the episode of show or movie

genres: A list of genres

production_countries: Countries that produced the show/movie

seasons: Number of seasons IF it is a show

imdb_id: The title ID on IMDB

imdb_score: Score on IMDB

imdb_votes: Votes on IMDB

tmdb_popularity: Popularity on TMDB

tmdb_score: Score on TMDB

Credits data:

5 variables, 64879 observations

person_ID: The person ID on JustWatch

id: The title ID on JustWatch

name: The name of actor or director

character_name: The name of character played in movie/show

role: ACTOR or DIRECTOR

Let us read our datas, shall we?

credits = read.csv("credits.csv", stringsAsFactors = FALSE)
titles = read.csv("titles.csv", stringsAsFactors = FALSE)
glimpse(credits)
## Rows: 64,879
## Columns: 5
## $ person_id <int> 14701, 14702, 14703, 14704, 14705, 14706, 1367, 14716, 14707…
## $ id        <chr> "tm77588", "tm77588", "tm77588", "tm77588", "tm77588", "tm77…
## $ name      <chr> "Humphrey Bogart", "Ingrid Bergman", "Paul Henreid", "Claude…
## $ character <chr> "Rick Blaine", "Ilsa Lund", "Victor Laszlo", "Captain Louis …
## $ role      <chr> "ACTOR", "ACTOR", "ACTOR", "ACTOR", "ACTOR", "ACTOR", "ACTOR…
glimpse(titles)
## Rows: 3,030
## Columns: 15
## $ id                   <chr> "tm77588", "tm155702", "tm83648", "tm3175", "ts22…
## $ title                <chr> "Casablanca", "The Wizard of Oz", "Citizen Kane",…
## $ type                 <chr> "MOVIE", "MOVIE", "MOVIE", "MOVIE", "SHOW", "MOVI…
## $ description          <chr> "In Casablanca, Morocco in December 1941, a cynic…
## $ release_year         <int> 1943, 1939, 1941, 1945, 1940, 1940, 1946, 1934, 1…
## $ age_certification    <chr> "PG", "G", "PG", "", "", "G", "", "", "", "PG-13"…
## $ runtime              <int> 102, 102, 119, 113, 8, 238, 114, 93, 111, 109, 12…
## $ genres               <chr> "['drama', 'romance', 'war']", "['fantasy', 'fami…
## $ production_countries <chr> "['US']", "['US']", "['US']", "['US']", "['US']",…
## $ seasons              <dbl> NA, NA, NA, NA, 16, NA, NA, NA, NA, NA, NA, NA, N…
## $ imdb_id              <chr> "tt0034583", "tt0032138", "tt0033467", "tt0037059…
## $ imdb_score           <dbl> 8.5, 8.1, 8.3, 7.5, 7.7, 8.2, 7.9, 7.9, 7.9, 8.3,…
## $ imdb_votes           <dbl> 577842, 406105, 446627, 25589, 859, 319463, 87289…
## $ tmdb_popularity      <dbl> 22.005, 56.631, 19.900, 8.311, 1.400, 27.535, 11.…
## $ tmdb_score           <dbl> 8.167, 7.583, 8.022, 7.000, 10.000, 8.000, 7.700,…

Whoops! let’s make it a little more readable

here’s our titles.csv

kable(head(credits),
      align = "c",
      caption = "Sample table of credits data",
      format = "html")
>>>>>>> 5eed062f8772c3055222fa3130bfd16e403091db
Sample table of credits data
person_id id name character role
14701 tm77588 Humphrey Bogart Rick Blaine ACTOR
14702 tm77588 Ingrid Bergman Ilsa Lund ACTOR
14703 tm77588 Paul Henreid Victor Laszlo ACTOR
14704 tm77588 Claude Rains Captain Louis Renault ACTOR
14705 tm77588 Conrad Veidt Major Heinrich Strasser ACTOR
14706 tm77588 Sydney Greenstreet Signor Ferrari ACTOR
<<<<<<< HEAD

And here’s our titles.csv

=======
#kable(head(titles),
 #     align = "c",
  #    caption = "Sample table of titles data",
   #   format = "html")

And here’s our titles.csv

titles <- within(titles, rm(description))
kable(head(titles),
      align = "c",
      caption = "Sample table of titles data",
      format = "html")
>>>>>>> 5eed062f8772c3055222fa3130bfd16e403091db
Sample table of titles data
id title type release_year age_certification runtime genres production_countries seasons imdb_id imdb_score imdb_votes tmdb_popularity tmdb_score
tm77588 Casablanca MOVIE 1943 PG 102 [‘drama’, ‘romance’, ‘war’] [‘US’] NA tt0034583 8.5 577842 22.005 8.167
tm155702 The Wizard of Oz MOVIE 1939 G 102 [‘fantasy’, ‘family’] [‘US’] NA tt0032138 8.1 406105 56.631 7.583
tm83648 Citizen Kane MOVIE 1941 PG 119 [‘drama’] [‘US’] NA tt0033467 8.3 446627 19.900 8.022
tm3175 Meet Me in St. Louis MOVIE 1945 113 [‘drama’, ‘family’, ‘romance’, ‘music’, ‘comedy’] [‘US’] NA tt0037059 7.5 25589 8.311 7.000
ts225761 Tom and Jerry SHOW 1940 8 [‘animation’, ‘comedy’, ‘family’, ‘action’] [‘US’] 16 tt6422744 7.7 859 1.400 10.000
tm156463 Gone with the Wind MOVIE 1940 G 238 [‘drama’, ‘romance’, ‘war’, ‘history’] [‘US’] NA tt0031381 8.2 319463 27.535 8.000
<<<<<<< HEAD

Firstly,let’s see how many movies and TV shows we are dealing with

##    type    n
## 1 MOVIE 2408
## 2  SHOW  622

Wow! that’s a lot more movies than shows! But let’s see it in a graph

And what’s the distribution of genres do we have from both?

Lets see it in a bar graph

Here’s the table of number of genres in descending order

kable(genre_counts, caption = "Number of movies and TV shows by genre")
Number of movies and TV shows by genre
genres type age_certification Count
documentation MOVIE 427
drama MOVIE 406
drama MOVIE R 379
comedy MOVIE 271
thriller MOVIE R 260
comedy MOVIE R 193
crime MOVIE R 186
drama MOVIE PG-13 175
action MOVIE PG-13 167
drama SHOW TV-MA 153
action MOVIE R 146
comedy SHOW TV-MA 135
european MOVIE 134
comedy MOVIE PG-13 129
romance MOVIE 126
comedy MOVIE PG 121
drama MOVIE PG 108
thriller MOVIE PG-13 103
horror MOVIE R 98
romance MOVIE PG-13 95
crime MOVIE 90
european MOVIE R 89
history MOVIE 89
romance MOVIE R 87
family MOVIE PG 85
crime SHOW TV-MA 76
scifi MOVIE PG-13 75
thriller MOVIE 71
fantasy MOVIE PG 69
action MOVIE PG 67
crime MOVIE PG-13 62
animation MOVIE PG 60
fantasy MOVIE PG-13 60
music MOVIE 59
action MOVIE 58
documentation SHOW TV-MA 58
scifi MOVIE R 58
scifi MOVIE PG 54
documentation MOVIE R 53
romance MOVIE PG 53
thriller SHOW TV-MA 53
comedy SHOW TV-14 51
documentation MOVIE PG 51
war MOVIE 50
drama SHOW TV-14 49
animation MOVIE PG-13 48
family MOVIE 48
fantasy MOVIE 47
sport MOVIE 47
fantasy MOVIE R 44
action SHOW TV-MA 42
documentation MOVIE PG-13 40
scifi SHOW TV-MA 40
european MOVIE PG-13 38
animation MOVIE 37
animation SHOW TV-MA 35
thriller MOVIE PG 35
animation SHOW TV-Y7 34
comedy MOVIE G 33
history MOVIE R 33
family MOVIE G 32
family SHOW TV-Y7 32
horror MOVIE 32
music MOVIE R 32
european MOVIE PG 31
comedy SHOW TV-PG 30
drama SHOW 29
fantasy SHOW TV-MA 29
history MOVIE PG-13 28
war MOVIE R 28
comedy SHOW TV-Y7 25
scifi MOVIE 25
drama MOVIE G 24
animation MOVIE G 23
crime SHOW TV-14 23
documentation SHOW TV-14 23
family SHOW TV-PG 23
romance SHOW TV-MA 23
action SHOW TV-14 22
action SHOW TV-PG 22
animation SHOW TV-PG 22
history SHOW TV-MA 22
reality SHOW 22
scifi SHOW TV-14 22
crime MOVIE PG 20
horror SHOW TV-MA 20
music MOVIE PG 20
scifi SHOW TV-PG 20
scifi SHOW TV-Y7 20
sport MOVIE PG-13 20
MOVIE 19
comedy SHOW 19
documentation SHOW 19
music MOVIE PG-13 19
animation SHOW TV-14 18
action SHOW TV-Y7 17
animation MOVIE R 17
documentation MOVIE G 17
drama SHOW TV-PG 17
fantasy MOVIE G 17
fantasy SHOW TV-14 17
horror MOVIE PG-13 17
romance SHOW TV-14 17
reality SHOW TV-14 16
romance MOVIE G 16
sport MOVIE R 16
war MOVIE PG 16
fantasy SHOW TV-PG 15
thriller SHOW TV-14 15
western MOVIE R 15
family MOVIE PG-13 14
fantasy SHOW TV-Y7 14
reality SHOW TV-MA 13
sport MOVIE PG 13
western MOVIE 13
action MOVIE G 12
war MOVIE PG-13 12
history MOVIE PG 11
music MOVIE G 11
scifi MOVIE G 11
comedy SHOW TV-G 10
documentation SHOW TV-PG 10
european SHOW TV-MA 10
family SHOW TV-G 10
horror MOVIE PG 10
reality SHOW TV-G 10
sport SHOW TV-MA 10
war SHOW TV-MA 10
western MOVIE PG-13 10
thriller SHOW 9
thriller SHOW TV-PG 9
animation SHOW TV-Y 8
european MOVIE G 8
music SHOW TV-MA 8
sport SHOW TV-14 8
animation SHOW 7
action SHOW 6
animation SHOW TV-G 6
crime SHOW 6
family SHOW TV-Y 6
romance SHOW TV-PG 6
thriller SHOW TV-Y7 6
SHOW 5
crime SHOW TV-PG 5
crime SHOW TV-Y7 5
drama MOVIE NC-17 5
family SHOW 5
family SHOW TV-MA 5
horror SHOW TV-14 5
western MOVIE PG 5
comedy SHOW TV-Y 4
drama SHOW TV-G 4
drama SHOW TV-Y7 4
fantasy SHOW TV-Y 4
history MOVIE G 4
reality SHOW TV-PG 4
romance SHOW 4
SHOW TV-MA 3
comedy MOVIE NC-17 3
documentation SHOW TV-G 3
european SHOW 3
family MOVIE R 3
family SHOW TV-14 3
history SHOW TV-14 3
horror MOVIE G 3
horror SHOW 3
horror SHOW TV-Y7 3
romance MOVIE NC-17 3
scifi SHOW TV-G 3
sport SHOW TV-PG 3
thriller MOVIE G 3
war SHOW 3
SHOW TV-14 2
SHOW TV-G 2
SHOW TV-Y 2
crime SHOW TV-G 2
drama SHOW TV-Y7-FV 2
european MOVIE NC-17 2
european SHOW TV-PG 2
family SHOW TV-Y7-FV 2
fantasy SHOW 2
history SHOW TV-PG 2
horror SHOW TV-PG 2
music SHOW 2
music SHOW TV-14 2
scifi SHOW TV-Y7-FV 2
war MOVIE G 2
war SHOW TV-14 2
western MOVIE G 2
MOVIE PG 1
MOVIE R 1
action SHOW TV-Y 1
action SHOW TV-Y7-FV 1
animation SHOW TV-Y7-FV 1
crime MOVIE G 1
crime MOVIE NC-17 1
crime SHOW TV-Y7-FV 1
documentation MOVIE NC-17 1
documentation SHOW TV-Y7 1
drama SHOW TV-Y 1
european SHOW TV-Y7-FV 1
fantasy SHOW TV-G 1
history MOVIE NC-17 1
history SHOW 1
horror MOVIE NC-17 1
horror SHOW TV-G 1
horror SHOW TV-Y 1
music SHOW TV-PG 1
music SHOW TV-Y 1
music SHOW TV-Y7 1
reality MOVIE 1
reality MOVIE PG 1
reality MOVIE PG-13 1
reality SHOW TV-Y7 1
romance SHOW TV-G 1
romance SHOW TV-Y7 1
scifi SHOW TV-Y 1
sport MOVIE NC-17 1
sport SHOW 1
sport SHOW TV-G 1
thriller MOVIE NC-17 1
thriller SHOW TV-G 1
thriller SHOW TV-Y 1
western SHOW TV-14 1
western SHOW TV-MA 1

Since we just finished oberserving the number of genres in our dataset

Let us see if there’s a correlation between age_restriction and genres

## removing missing values
titles = na.omit(titles)


## Cleaning our age certification column 
#titles$age_certification = gsub( " .*", "", titles$age_certification)

age_genre = genre_counts %>% 
  group_by(age_certification,genres) %>% 
  summarize(age_genre_count = n()) %>% 
  ungroup ()

## Plotting
ggplot(age_genre,aes(x=age_certification , y=age_genre_count, fill = genres)) + geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Age Certification", y = "Number of Titles", title = "Age Certification and Genres") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  guides(fill = guide_legend(nrow = 2, byrow = TRUE))

unique(titles$age_certification)
## [1] ""         "TV-G"     "TV-Y"     "TV-Y7"    "TV-PG"    "TV-14"    "TV-MA"   
## [8] "TV-Y7-FV"

Here are the number of shows available in HBO as a function of time

You can see there is a wide range of movies and tv shows, especially what year they were released. I wonder what the newest and oldest movies are?

## [1] title        type         release_year genres      
## <0 rows> (or 0-length row.names)
## [1] title        type         release_year genres      
## <0 rows> (or 0-length row.names)
##          title type release_year                                        genres
## 1 Looney Tunes SHOW         1929 ['comedy', 'family', 'thriller', 'animation']
##            title type release_year
## 1 The Last of Us SHOW         2023
##                                               genres
## 1 ['drama', 'action', 'horror', 'scifi', 'thriller']

I definitely have not seen either of those movies, but everyone should know last of us because of tiktok.

Now I am wondering what is the longest movie?

shortest_movie <- titles %>%
  filter(type == "MOVIE") %>%
  arrange(runtime) %>%
  select(title, type,runtime, release_year, genres) %>%
  head(1)

shortest_movie
## [1] title        type         runtime      release_year genres      
## <0 rows> (or 0-length row.names)
longest_movie <- titles %>%
  filter(type == "MOVIE") %>%
  arrange(desc(runtime)) %>%
  select(title, type, runtime, release_year, genres) %>%
  head(1)

longest_movie
## [1] title        type         runtime      release_year genres      
## <0 rows> (or 0-length row.names)
shortest_show <- titles %>%
  filter(type == "SHOW") %>%
  arrange(runtime, seasons) %>%
  select(title, type, runtime, seasons, release_year, genres) %>%
  head(1)

shortest_show
##                 title type runtime seasons release_year     genres
## 1 Garfunkel and Oates SHOW       4       1         2012 ['comedy']
longest_show <- titles %>%
  filter(type == "SHOW") %>%
  arrange(desc(runtime)) %>%
  arrange(desc(seasons)) %>%
  select(title, type, runtime, seasons, release_year, genres) %>%
  head(1)

longest_show
##           title type runtime seasons release_year
## 1 Sesame Street SHOW      51      53         1969
##                                                  genres
## 1 ['comedy', 'animation', 'family', 'fantasy', 'music']

Now lets look at the credits data.

credits %>%
  count(role)
##       role     n
## 1    ACTOR 62158
## 2 DIRECTOR  2721

Are any of these actors/directors in multiple projects? If so, who was in the most projects?

project_count <- credits %>%
  count(name)

glimpse(project_count)
## Rows: 45,276
## Columns: 2
## $ name <chr> "'Auntie' Mackay", "'Little Man' Machan", "'Weird Al' Yankovic", …
## $ n    <int> 1, 3, 3, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1,…
most_projects <- credits %>% 
  count(name) %>% 
  slice_max(n)

most_projects
##           name  n
## 1 Grey DeLisle 60
##Who is this person?

credits %>% 
  filter(name == "Grey DeLisle")
=======

What if we try to combine these data sets?

both_data <- inner_join(titles, credits, by = "id")

kable(head(both_data),
      align = "c",
      caption = "Sample table of both data",
      format = "html")
Sample table of both data
id title type release_year age_certification runtime genres production_countries seasons imdb_id imdb_score imdb_votes tmdb_popularity tmdb_score person_id name character role
tm77588 Casablanca MOVIE 1943 PG 102 [‘drama’, ‘romance’, ‘war’] [‘US’] NA tt0034583 8.5 577842 22.005 8.167 14701 Humphrey Bogart Rick Blaine ACTOR
tm77588 Casablanca MOVIE 1943 PG 102 [‘drama’, ‘romance’, ‘war’] [‘US’] NA tt0034583 8.5 577842 22.005 8.167 14702 Ingrid Bergman Ilsa Lund ACTOR
tm77588 Casablanca MOVIE 1943 PG 102 [‘drama’, ‘romance’, ‘war’] [‘US’] NA tt0034583 8.5 577842 22.005 8.167 14703 Paul Henreid Victor Laszlo ACTOR
tm77588 Casablanca MOVIE 1943 PG 102 [‘drama’, ‘romance’, ‘war’] [‘US’] NA tt0034583 8.5 577842 22.005 8.167 14704 Claude Rains Captain Louis Renault ACTOR
tm77588 Casablanca MOVIE 1943 PG 102 [‘drama’, ‘romance’, ‘war’] [‘US’] NA tt0034583 8.5 577842 22.005 8.167 14705 Conrad Veidt Major Heinrich Strasser ACTOR
tm77588 Casablanca MOVIE 1943 PG 102 [‘drama’, ‘romance’, ‘war’] [‘US’] NA tt0034583 8.5 577842 22.005 8.167 14706 Sydney Greenstreet Signor Ferrari ACTOR

Firstly let’s see how many movies and TV shows we are dealing with

titles %>% 
  count(type)
##    type    n
## 1 MOVIE 2408
## 2  SHOW  622

Wow! that’s a lot more movies than shows! But let’s see it visually

# Create a data frame with counts of movies and shows
title_counts = data.frame(
  type = c("MOVIE", "SHOW"),
  Count = c(sum(titles$type == "MOVIE"), sum(titles$type == "SHOW"))
)

# Create the bar chart
ggplot(title_counts, aes(x = type, y = Count, fill = type)) +
  geom_bar(stat = "identity") +
  ggtitle("Number of Movies and Shows") +
  xlab("") +
  ylab("Count")

What is the range for our data?

range(titles$release_year)
## [1] 1901 2023

Out of curiosity, what are those titles?

oldest_title <- titles %>%
  filter(release_year == "1901")

oldest_title
##        id                   title  type release_year age_certification runtime
## 1 tm54582 The Prince of Magicians MOVIE         1901                         2
##       genres production_countries seasons imdb_id imdb_score imdb_votes
## 1 ['comedy']               ['FR']      NA                 NA         NA
##   tmdb_popularity tmdb_score
## 1           1.747          6
newest_title <- titles %>%
  filter(release_year == "2023")

newest_title
##           id                                                    title  type
## 1   ts226904                                           The Last of Us  SHOW
## 2   ts283518                                                    Velma  SHOW
## 3   ts375134                                                Rain Dogs  SHOW
## 4   ts374449                                                The Climb  SHOW
## 5  tm1301628                           Marc Maron: From Bleak to Dark MOVIE
## 6  tm1040094                                              House Party MOVIE
## 7  tm1310730                              Marlon Wayans: God Loves Me MOVIE
## 8   ts171230                                               Poor Devil  SHOW
## 9  tm1015760                                Chernobyl: The Lost Tapes MOVIE
## 10 tm1306271                         The Weeknd: Live at SoFi Stadium MOVIE
## 11 tm1306569                      Chasing Greatness: Coach K x LeBron MOVIE
## 12 tm1305288                       Marcella Arguello: Bitch, Grow Up! MOVIE
## 13 tm1303655                                 Super-Vilains: l'Enquête MOVIE
## 14 tm1296261 Just a Boy From Tupelo: Bringing Elvis to the Big Screen MOVIE
## 15 tm1065897                       Dionne Warwick: Don't Make Me Over MOVIE
## 16 tm1304306                                       The Family Meeting MOVIE
##    release_year age_certification runtime
## 1          2023             TV-MA      60
## 2          2023             TV-MA      25
## 3          2023                        27
## 4          2023                        44
## 5          2023                        65
## 6          2023                 R     100
## 7          2023                        60
## 8          2023             TV-MA      22
## 9          2023                        96
## 10         2023                 R      98
## 11         2023                        30
## 12         2023                 R      37
## 13         2023             PG-13      62
## 14         2023             PG-13      27
## 15         2023                PG      95
## 16         2023                        15
##                                                genres production_countries
## 1  ['drama', 'action', 'horror', 'scifi', 'thriller']               ['US']
## 2                    ['comedy', 'crime', 'animation']               ['US']
## 3                                 ['drama', 'comedy']         ['GB', 'US']
## 4                                         ['reality']               ['US']
## 5                         ['comedy', 'documentation']               ['US']
## 6                                          ['comedy']               ['US']
## 7                                          ['comedy']               ['US']
## 8                             ['comedy', 'animation']               ['ES']
## 9                        ['documentation', 'history']               ['GB']
## 10                                          ['music']               ['US']
## 11                                                 []               ['BR']
## 12                                         ['comedy']               ['US']
## 13                                  ['documentation']               ['FR']
## 14                                  ['documentation']                   []
## 15                         ['documentation', 'music']         ['US', 'GB']
## 16                                                 []                   []
##    seasons    imdb_id imdb_score imdb_votes tmdb_popularity tmdb_score
## 1        1 tt11915056        9.1     255529        3481.253      8.798
## 2        1 tt14153790        1.5      70034         130.974      3.399
## 3        1 tt19050000         NA         NA           0.600         NA
## 4        1 tt15082926        6.8        376          26.332         NA
## 5       NA tt26453369        7.1        787           6.638      5.000
## 6       NA  tt8005118        4.4       2360         103.564      6.500
## 7       NA tt26753138        6.3        204          15.338      6.700
## 8        1 tt15764846        6.6        265          13.511      7.714
## 9       NA tt13913326        7.9       1424              NA         NA
## 10      NA tt26685153        8.1        257          23.370      5.800
## 11      NA                    NA         NA           3.974         NA
## 12      NA tt26623699        6.9         27           7.509      2.000
## 13      NA tt26498712        5.5         45           3.402      6.000
## 14      NA                    NA         NA           2.605      4.500
## 15      NA  tt6170406        7.8        255           9.371         NA
## 16      NA                    NA         NA           3.091      2.000

Now let’s see what are the top 10 most popular movies and show from imbd and tmdb

top_10_movies <- titles %>% 
  filter(type == "MOVIE") %>%
  arrange(desc(imdb_score)) %>%
  select(title, type, release_year, genres, ) %>%
  head(10)
top_10_movies
##                                                title  type release_year
## 1                           The Shawshank Redemption MOVIE         1994
## 2                                    Celebrity Habla MOVIE         2009
## 3                                  Emergency Contact MOVIE         2015
## 4                                    The Dark Knight MOVIE         2008
## 5      The Lord of the Rings: The Return of the King MOVIE         2003
## 6                Euphoria: Trouble Don't Last Always MOVIE         2020
## 7        Juan Luis Guerra 4.40: Entre Mar y Palmeras MOVIE         2021
## 8  The Lord of the Rings: The Fellowship of the Ring MOVIE         2001
## 9              The Lord of the Rings: The Two Towers MOVIE         2002
## 10                                 Celebrity Habla 2 MOVIE         2010
##                                      genres
## 1                                 ['drama']
## 2                         ['documentation']
## 3                                ['comedy']
## 4  ['drama', 'thriller', 'action', 'crime']
## 5            ['fantasy', 'action', 'drama']
## 6                                 ['drama']
## 7                                 ['music']
## 8            ['fantasy', 'action', 'drama']
## 9            ['action', 'fantasy', 'drama']
## 10                        ['documentation']
top_10_shows <- titles %>% 
  filter(type == "SHOW") %>%
  arrange(desc(imdb_score)) %>%
  select(title, type, release_year, genres, ) %>%
  head(10)



top_10_movies
##                                                title  type release_year
## 1                           The Shawshank Redemption MOVIE         1994
## 2                                    Celebrity Habla MOVIE         2009
## 3                                  Emergency Contact MOVIE         2015
## 4                                    The Dark Knight MOVIE         2008
## 5      The Lord of the Rings: The Return of the King MOVIE         2003
## 6                Euphoria: Trouble Don't Last Always MOVIE         2020
## 7        Juan Luis Guerra 4.40: Entre Mar y Palmeras MOVIE         2021
## 8  The Lord of the Rings: The Fellowship of the Ring MOVIE         2001
## 9              The Lord of the Rings: The Two Towers MOVIE         2002
## 10                                 Celebrity Habla 2 MOVIE         2010
##                                      genres
## 1                                 ['drama']
## 2                         ['documentation']
## 3                                ['comedy']
## 4  ['drama', 'thriller', 'action', 'crime']
## 5            ['fantasy', 'action', 'drama']
## 6                                 ['drama']
## 7                                 ['music']
## 8            ['fantasy', 'action', 'drama']
## 9            ['action', 'fantasy', 'drama']
## 10                        ['documentation']
top_10_shows
##                          title type release_year
## 1             Band of Brothers SHOW         2001
## 2                    Chernobyl SHOW         2019
## 3                     The Wire SHOW         2002
## 4            Eyes on the Prize SHOW         1987
## 5                 The Sopranos SHOW         1999
## 6              Game of Thrones SHOW         2011
## 7               Rick and Morty SHOW         2013
## 8                    Homegrown SHOW         2021
## 9               The Last of Us SHOW         2023
## 10 Batman: The Animated Series SHOW         1992
##                                                          genres
## 1                         ['drama', 'war', 'history', 'action']
## 2                              ['drama', 'thriller', 'history']
## 3                                ['drama', 'crime', 'thriller']
## 4                                  ['documentation', 'history']
## 5                                            ['drama', 'crime']
## 6            ['scifi', 'drama', 'action', 'romance', 'fantasy']
## 7                    ['animation', 'scifi', 'action', 'comedy']
## 8                                    ['documentation', 'drama']
## 9            ['drama', 'action', 'horror', 'scifi', 'thriller']
## 10 ['family', 'scifi', 'animation', 'action', 'crime', 'drama']

Now lets look at the credits data.

credits %>%
  count(role)
##       role     n
## 1    ACTOR 62158
## 2 DIRECTOR  2721

Are any of these actors/directors in multiple projects? If so, who was in the most projects?

project_count <- credits %>%
  count(name)

glimpse(project_count)
## Rows: 45,276
## Columns: 2
## $ name <chr> " Amanda Phillips", "'Auntie' Mackay", "'Little Man' Machan", "'W…
## $ n    <int> 1, 1, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1,…
most_projects <- credits %>% 
  count(name) %>% 
  slice_max(n)

most_projects
##           name  n
## 1 Grey DeLisle 60
##Who is this person?

credits %>% 
  filter(name == "Grey DeLisle")
>>>>>>> 5eed062f8772c3055222fa3130bfd16e403091db
##    person_id        id         name
## 1      14142   ts21507 Grey DeLisle
## 2      14142   ts21601 Grey DeLisle
## 3      14142   ts20381 Grey DeLisle
## 4      14142   ts22480 Grey DeLisle
## 5      14142    ts5042 Grey DeLisle
## 6      14142   tm94209 Grey DeLisle
## 7      14142   tm93784 Grey DeLisle
## 8      14142   tm58727 Grey DeLisle
## 9      14142  tm656353 Grey DeLisle
## 10     14142   tm43061 Grey DeLisle
## 11     14142   tm23583 Grey DeLisle
## 12     14142  tm140798 Grey DeLisle
## 13     14142   ts20574 Grey DeLisle
## 14     14142  tm167763 Grey DeLisle
## 15     14142   tm62126 Grey DeLisle
## 16     14142   tm30342 Grey DeLisle
## 17     14142  tm167388 Grey DeLisle
## 18     14142   tm65596 Grey DeLisle
## 19     14142  tm160240 Grey DeLisle
## 20     14142  tm160231 Grey DeLisle
## 21     14142  tm151689 Grey DeLisle
## 22     14142  tm179476 Grey DeLisle
## 23     14142  tm177932 Grey DeLisle
## 24     14142    ts3139 Grey DeLisle
## 25     14142  tm159051 Grey DeLisle
## 26     14142  tm171655 Grey DeLisle
## 27     14142   tm63619 Grey DeLisle
## 28     14142  tm152510 Grey DeLisle
## 29     14142  tm152501 Grey DeLisle
## 30     14142   ts37956 Grey DeLisle
## 31     14142  tm195247 Grey DeLisle
## 32     14142  tm244555 Grey DeLisle
## 33     14142  tm238389 Grey DeLisle
## 34     14142  tm193859 Grey DeLisle
## 35     14142  tm244564 Grey DeLisle
## 36     14142  tm219341 Grey DeLisle
## 37     14142  tm214711 Grey DeLisle
## 38     14142  tm244119 Grey DeLisle
## 39     14142  tm244479 Grey DeLisle
## 40     14142  tm138025 Grey DeLisle
## 41     14142  tm365731 Grey DeLisle
## 42     14142  tm422417 Grey DeLisle
## 43     14142  tm301058 Grey DeLisle
## 44     14142  tm372838 Grey DeLisle
## 45     14142  tm361837 Grey DeLisle
## 46     14142  tm326858 Grey DeLisle
## 47     14142  tm414009 Grey DeLisle
## 48     14142  tm405461 Grey DeLisle
## 49     14142  tm317933 Grey DeLisle
## 50     14142  tm423754 Grey DeLisle
## 51     14142  tm820756 Grey DeLisle
## 52     14142   ts89867 Grey DeLisle
## 53     14142  tm894108 Grey DeLisle
## 54     14142  tm883958 Grey DeLisle
## 55     14142 tm1248448 Grey DeLisle
## 56     14142 tm1028015 Grey DeLisle
## 57     14142 tm1065433 Grey DeLisle
## 58     14142  tm930306 Grey DeLisle
## 59     14142 tm1171238 Grey DeLisle
## 60     14142  tm987899 Grey DeLisle
##                                                                           character
## 1                                                              Daphne Blake (voice)
## 2                                                        The High Priestess (voice)
## 3                                                              Daphne Blake (voice)
## 4                                                                     Mandy (voice)
## 5                                                  Frances 'Frankie' Foster (voice)
## 6                                                                    Daphne (voice)
## 7                                                              Daphne Blake (voice)
## 8                                             Daphne / Cat Witch / Honeybee (voice)
## 9  Frankie Foster / Tiny Friend / Little Boy Voice / Lady (voice) (as Grey DeLisle)
## 10                                        Crazy Old Cat Lady/Gramma Stuffum (voice)
## 11                                                                   Daphne (voice)
## 12                                                                   Daphne (voice)
## 13                                                                                 
## 14                                                           Barbara Gordon (voice)
## 15                                             Anchor Carla / Female Mutant (voice)
## 16                                                        Lois Lane / Queen (voice)
## 17                                       Ree'Yu / Ardakian Trawl / Boodikka (voice)
## 18                                                         Young Manchester (voice)
## 19                                                             Daphne Blake (voice)
## 20                                                             Daphne Blake (voice)
## 21                                            Grandmother (voice) (as Grey Griffin)
## 22                    Nora Allen / Young Barry Allen / Martha Wayne / Joker (voice)
## 23                                                             Anchor Carla (voice)
## 24                                                Margaret Sorrow / Magpie  (voice)
## 25                       Wonder Woman / Superbaby (voice) (as Grey DeLisle Griffin)
## 26                                                             Daphne Blake (voice)
## 27                                                             Daphne Blake (voice)
## 28                                                             Daphne Blake (voice)
## 29                                                             Daphne Blake (voice)
## 30                                                             Daphne Blake (voice)
## 31                                                          Tina / Platinum (voice)
## 32                                                             Daphne Blake (voice)
## 33                                                             Wonder Woman (voice)
## 34                                                                 Samantha (voice)
## 35                                                             Wonder Woman (voice)
## 36                                                 Wonder Woman / Lois Lane (voice)
## 37                                                             Daphne Blake (voice)
## 38                                                             Daphne Blake (voice)
## 39                                                             Wonder Woman (voice)
## 40                                                             Daphne Blake (voice)
## 41                                Sister Leslie / Jason / Additional Voices (voice)
## 42                                                             Daphne Blake (voice)
## 43                                                             Daphne Blake (voice)
## 44                        Wonder Woman / Diana Prince (voice) and Lois Lane (voice)
## 45                                              Daphne Blake / Black Canary (voice)
## 46                                                             Daphne Blake (voice)
## 47           Diana Prince / Wonder Woman (voice) / Lois Lane (voice) / Ring (voice)
## 48                                                         Wonder Woman / Lois Lane
## 49                                                  Wonder Woman / Platinum (voice)
## 50                                                             Wonder Woman (voice)
## 51                                                               Mrs. Claus (Voice)
## 52                                                             Daphne Blake (voice)
## 53                                                        Additional Voices (voice)
## 54                                         Wonder Woman (voice) / Lois Lane (voice)
## 55                                     Daphne / Daisy / Musketeer 1 / Olive (voice)
## 56                                   Beelzebub / Little Della / Little Jack (voice)
## 57                                         Daphne Blake / Frau Glockenspiel (voice)
## 58                                                                 Lady Eve (voice)
## 59                                              Diana Prince / Wonder Woman (voice)
## 60                                                             Daphne Blake (voice)
##     role
## 1  ACTOR
## 2  ACTOR
## 3  ACTOR
## 4  ACTOR
## 5  ACTOR
## 6  ACTOR
## 7  ACTOR
## 8  ACTOR
## 9  ACTOR
## 10 ACTOR
## 11 ACTOR
## 12 ACTOR
## 13 ACTOR
## 14 ACTOR
## 15 ACTOR
## 16 ACTOR
## 17 ACTOR
## 18 ACTOR
## 19 ACTOR
## 20 ACTOR
## 21 ACTOR
## 22 ACTOR
## 23 ACTOR
## 24 ACTOR
## 25 ACTOR
## 26 ACTOR
## 27 ACTOR
## 28 ACTOR
## 29 ACTOR
## 30 ACTOR
## 31 ACTOR
## 32 ACTOR
## 33 ACTOR
## 34 ACTOR
## 35 ACTOR
## 36 ACTOR
## 37 ACTOR
## 38 ACTOR
## 39 ACTOR
## 40 ACTOR
## 41 ACTOR
## 42 ACTOR
## 43 ACTOR
## 44 ACTOR
## 45 ACTOR
## 46 ACTOR
## 47 ACTOR
## 48 ACTOR
## 49 ACTOR
## 50 ACTOR
## 51 ACTOR
## 52 ACTOR
## 53 ACTOR
## 54 ACTOR
## 55 ACTOR
## 56 ACTOR
## 57 ACTOR
## 58 ACTOR
## 59 ACTOR
## 60 ACTOR

And what’s the distribution of genres do we have from both?

#MOVIE_genre_data= titles %>% 
#  separate_rows(genres, sep = ", ") %>% 
#  group_by(type = "MOVIE",genres) %>% 
#  summarize(Count = n()) %>% 
#  ungroup()

genre_counts <- titles %>%
  mutate(genres = str_remove_all(genres, "'")) %>% 
  mutate(genres = gsub("\\[", "", genres)) %>% 
  mutate(genres = gsub("\\]", "", genres)) %>% 
  separate_rows(genres, sep = ", ") %>%
  group_by(genres, type) %>%
  summarize(Count = n()) %>%
  ungroup() %>%
  arrange(desc(Count))

# Create the bar chart
ggplot(genre_counts, aes(x = reorder(genres, Count), y = Count, fill = type)) +
  geom_bar(stat = "identity")  +
  labs(x = "Genre", y = "Count", title = "Distribution of Genres") +
  theme_minimal()

Looks like the type of genres are hard to read. Let’s flip our coordinates

# genre_counts <- titles %>%
#   separate_rows(genres, sep = ", ") %>%
#   group_by(genres) %>%
#   summarize(Count = n()) %>%
#   ungroup() %>%   #ungroup() function is used to remove the grouping structure from the data frame after performing the group by operation.In this case, after calculating the genre counts within each group using summarize(), we want to work with the data as a whole, not just within each genre group. So, ungroup() is used to remove the grouping structure and return the data to its original form,
#   arrange(desc(Count))

# Create the bar chart
#ggplot(genre_counts, aes(x = reorder(genres, Count), y = Count)) +
#  geom_bar(stat = "identity", fill = "purple")  +
#  coord_flip() +
#  labs(x = "Genre", y = "Count", title = "Distribution of Genres", ) +
#  theme_minimal()

genre_counts <- titles %>%
  mutate(genres = str_remove_all(genres, "'")) %>% 
  mutate(genres = gsub("\\[", "", genres)) %>% 
  mutate(genres = gsub("\\]", "", genres)) %>% 
  separate_rows(genres, sep = ", ") %>%
  group_by(genres, type) %>%
  summarize(Count = n()) %>%
  ungroup() %>%
  arrange(desc(Count))

# Create the bar chart
ggplot(genre_counts, aes(x = reorder(genres, Count), y = Count, fill = type)) +
  geom_bar(stat = "identity")  +
  labs(x = "Genre", y = "Count", title = "Distribution of Genres") +
  theme_minimal()+coord_flip()

Here are the number of shows available in Netflix as a function of time¶

# titles$release_year <- as.Date(paste0(titles$release_year, "-01-01"))  
# convert release_year to date format

# titles$release_date <- as.Date(paste0("01-01-", titles$release_year), format = "%d-%m-%Y")

# create type column

#titles$type <- ifelse(titles$type == "SHOW", "MOVIE", "no")

# count number of titles by year and type
title_counts <- titles %>%
  group_by(release_year, type) %>%
  summarize(count = n())

# plot number of titles by year and type
ggplot(titles , aes(x = release_year, fill = type)) +
  geom_bar() +
  labs(x = "Release Year", y = "Number of Titles", title = "Number of Shows and Movies Available by Year") +
  scale_fill_manual(values = c("SHOW" = "purple", "MOVIE" = "darkgrey")) +
  theme(plot.title = element_text(hjust = 0.5)) 

Let’s find out

<<<<<<< HEAD

Here’s both those graphs combined

genre_popularity <- titles %>%
  mutate(genres = str_remove_all(genres, "'")) %>% 
  mutate(genres = gsub("\\[", "", genres)) %>% 
  mutate(genres = gsub("\\]", "", genres)) %>% 
  separate_rows(genres, sep = ", ") %>%
  group_by(genres, type,tmdb_popularity,tmdb_score, imdb_score ) %>%
  summarize(Count = n()) %>%
  ungroup() %>%
  arrange(desc(tmdb_popularity))

genre_popularity
## # A tibble: 1,523 × 6
##    genres   type  tmdb_popularity tmdb_score imdb_score Count
##    <chr>    <chr>           <dbl>      <dbl>      <dbl> <int>
##  1 action   SHOW            3481.       8.80        9.1     1
##  2 drama    SHOW            3481.       8.80        9.1     1
##  3 horror   SHOW            3481.       8.80        9.1     1
##  4 scifi    SHOW            3481.       8.80        9.1     1
##  5 thriller SHOW            3481.       8.80        9.1     1
##  6 action   SHOW             559.       8.4         9.2     1
##  7 drama    SHOW             559.       8.4         9.2     1
##  8 fantasy  SHOW             559.       8.4         9.2     1
##  9 romance  SHOW             559.       8.4         9.2     1
## 10 scifi    SHOW             559.       8.4         9.2     1
## # … with 1,513 more rows
# Create the bar chart
ggplot(genre_popularity, aes(x = reorder(genres, Count), y = tmdb_popularity, fill = type)) +
  geom_bar(stat = "identity")  +
  labs(x = "Genre", y = "tmdb_popularity", title = "Genres and its popularity") +
  theme_light()+coord_flip()

=======
genre_popularity <- titles %>%
  mutate(genres = str_remove_all(genres, "'")) %>% 
  mutate(genres = gsub("\\[", "", genres)) %>% 
  mutate(genres = gsub("\\]", "", genres)) %>% 
  separate_rows(genres, sep = ", ") %>%
  group_by(genres, type,tmdb_popularity,tmdb_score ) %>%
  summarize(Count = n()) %>%
  ungroup() %>%
  arrange(desc(tmdb_popularity))

genre_popularity
## # A tibble: 7,559 × 5
##    genres   type  tmdb_popularity tmdb_score Count
##    <chr>    <chr>           <dbl>      <dbl> <int>
##  1 action   SHOW            3481.       8.80     1
##  2 drama    SHOW            3481.       8.80     1
##  3 horror   SHOW            3481.       8.80     1
##  4 scifi    SHOW            3481.       8.80     1
##  5 thriller SHOW            3481.       8.80     1
##  6 action   MOVIE            696.       7.13     1
##  7 fantasy  MOVIE            696.       7.13     1
##  8 scifi    MOVIE            696.       7.13     1
##  9 action   SHOW             559.       8.4      1
## 10 drama    SHOW             559.       8.4      1
## # … with 7,549 more rows
# Create the bar chart
ggplot(genre_popularity, aes(x = reorder(genres, Count), y = tmdb_popularity, fill = type)) +
  geom_bar(stat = "identity")  +
  labs(x = "Genre", y = "tmdb_popularity", title = "Genres and its popularity") +
  theme_light()+coord_flip()

>>>>>>> 5eed062f8772c3055222fa3130bfd16e403091db

Who would’ve know?!

Let us look at the Number of movies and TV shows by country

Unfortunately, because HBO only got their movies and shows from 99 countries, there are some uncolored countries

Number of movies and TV shows by country
production_countries type total full_country_name
US SHOW 426 United States
GB SHOW 32 United Kingdom
ES SHOW 18 Spain
BR SHOW 9 Brazil
CA SHOW 7 Canada
AR SHOW 6 Argentina
TW SHOW 6 Taiwan
DE SHOW 5 Germany
JP SHOW 5 Japan
MX SHOW 5 Mexico
FR SHOW 4 France
IT SHOW 4 Italy
CL SHOW 2 Chile
CZ SHOW 2 Czechia
ID SHOW 2 Indonesia
IL SHOW 2 Israel
PL SHOW 2 Poland
RO SHOW 2 Romania
SG SHOW 2 Singapore
AU SHOW 1 Australia
DK SHOW 1 Denmark
HU SHOW 1 Hungary
NZ SHOW 1 New Zealand
PH SHOW 1 Philippines
RU SHOW 1 Russia
UY SHOW 1 Uruguay